Recognition of the Script in Serbian Documents Using Frequency Occurrence and Co-Occurrence Analysis

نویسندگان

Darko Brodić

Zoran N. Milivojević

Čedomir A. Maluckov

چکیده

Any document in Serbian language can be written in two different scripts: Latin or Cyrillic. Although characteristics of these scripts are similar, some of their statistical measures are quite different. The paper proposed a method for the extraction of certain script from document according to the occurrence and co-occurrence of the script types. First, each letter is modeled with the certain script type according to characteristics concerning its position in baseline area. Then, the frequency analysis of the script types occurrence is performed. Due to diversity of Latin and Cyrillic script, the occurrence of modeled letters shows substantial statistics dissimilarity. Furthermore, the co-occurrence matrix is computed. The analysis of the co-occurrence matrix draws a strong margin as a criteria to distinguish and recognize the certain script. The proposed method is analyzed on the case of a database which includes different types of printed and web documents. The experiments gave encouraging results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Global Approach for Script Identification using Wavelet Packet Based Features

In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documen...

متن کامل

Wavelet Packet Based Texture Features for Automatic Script Identification

In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...

متن کامل

Mapping the Scientific Structure of Iranian Brucellosis Researches Using the Co-authorship and Co-occurrence Network Analysis

Background and Objective: The evaluation of the publishing trend of articles in various scientific fields provides an insight into the efforts of researchers in the field of knowledge. Accordingly, the present study has evaluated and analyzed the scientific publications on brucellosis conducted by Iranian researchers using scientometrics methods and analysis of social networks. Methods: The pr...

متن کامل

Drawing Word co-occurrence map of Spinal Muscular Atrophy disease

Introduction: The purpose of this article is to evaluate the status of articles in the field of Spinal Muscular Atrophy According to the Scientometrics indices Word co-occurrence map of this field . Methods: The present study is an applied one with a quantitative approach and a descriptive approach. It has been done using scientometrics and the co-occurrence words analysis technique. Document...

متن کامل

The analysis of co-citation and word co-occurrence networks of Iranian articles in the field of dentistry

Background and Aims: Dentistry is an important profession ensuring the health of body and soul, and has a special place in the scientific productions of medical disciplines. The purpose of this study was to analyze the co-citation and word co-occurrence of Iranian research papers in the field of dentistry based on indexed documents in Web of Science from 2014 to 2018. Materials and Methods:...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2013 شماره

صفحات -

تاریخ انتشار 2013

Recognition of the Script in Serbian Documents Using Frequency Occurrence and Co-Occurrence Analysis

نویسندگان

چکیده

منابع مشابه

Global Approach for Script Identification using Wavelet Packet Based Features

Wavelet Packet Based Texture Features for Automatic Script Identification

Mapping the Scientific Structure of Iranian Brucellosis Researches Using the Co-authorship and Co-occurrence Network Analysis

Drawing Word co-occurrence map of Spinal Muscular Atrophy disease

The analysis of co-citation and word co-occurrence networks of Iranian articles in the field of dentistry

عنوان ژورنال:

اشتراک گذاری